At the end of 2019, a new viral bacteria took the world by surprise. Unrealized by the world until mid 2020, this virus has made nations rethink how to handle technology, production, policies, medical emergencies, and handling of the economy. Disruptions in the production of common goods have been shot due to policies requiring workers to stay home. Medical researchers to find cures and a policy to allow rapid testing and approvals for mass production was limit tested. A lack of goods brought rare metal productions to a halt, causing technology advancements, including cars, computers, phones, even as simple as a microwave, to be highly limited. With a lack of production, rules requiring distancing, fear of a highly contagious virus spreading around mom-and-pop stores, the global economy was shut down.
Small private shops did what they could to keep afloat, corporations pushed for teleworking options, hospitals were pushed to their limits, and many non-critical government agencies were forced to minimalize their work force to support the global movement on stopping Corona Virus from spreading.
Talk about Covid major dates for timeline….. (Part of final)
Data Montgomery has provided a standardized report of employee salaries for many consecutive years. We will be taking a look into the 2019 to 2022 to find trends in overall employment for Montgomery county, look at how wages have been impacted from Covid’s economic impact, and see if the lack of employment has had any major impact to the overtime usage.
| Column Information | Type | Description |
|---|---|---|
| Department | chr | Department Code |
| Department Name | chr | Name of the Department |
| Division | chr | Division including division code |
| Gender | chr | Gender of employee |
| Base Salary | num | Rate of pay for the given employee |
| Overtime Pay | num | Amount of overtime received in the year |
| Longevity Pay | num | Amount of Longevity pay recieved in the year |
| Grade | chr | Salary grade for the given employee |
General questions: There seems to be some departments with large pay gaps, can we find potential causes?
Has overtime pay overstayed it’s welcome?
How does gender play a role in salary statistics for Montgomery County?
Are we getting to the point where job statistics are getting back to pre-covid events?
pacman::p_load(tidyverse,dplyr,readxl,plotly,ggthemes, viridis)
setwd("~/Data Study/Data 205 MC")
db2019 <- read.csv("Employee_Salaries_2019.csv")
db2020 <- read.csv("Employee_Salaries_2020.csv")
db2021 <- read.csv("Employee_Salaries_2021.csv")
db2022 <- read.csv("Employee_Salaries_2022.csv")
str(db2019)
## 'data.frame': 10105 obs. of 8 variables:
## $ Department : chr "BOA" "BOA" "BOA" "BOE" ...
## $ Department.Name : chr "Board of Appeals Department" "Board of Appeals Department" "Board of Appeals Department" "Board of Elections" ...
## $ Division : chr "Board of Appeals Division" "Board of Appeals Division" "Executive Director" "Director" ...
## $ Gender : chr "F" "F" "F" "F" ...
## $ Base.Salary : num 78902 58482 144751 183654 62488 ...
## $ X2019.Overtime.Pay : num 0 0 0 0 0 0 0 0 0 0 ...
## $ X2019.Longevity.Pay: num 0 0 0 0 0 ...
## $ Grade : chr "18" "16" "M3" "M1" ...
colSums(is.na(db2019))
## Department Department.Name Division Gender
## 0 0 0 0
## Base.Salary X2019.Overtime.Pay X2019.Longevity.Pay Grade
## 0 0 0 0
str(db2020)
## 'data.frame': 9958 obs. of 8 variables:
## $ Department : chr "ABS" "ABS" "ABS" "ABS" ...
## $ Department.Name : chr "Alcohol Beverage Services" "Alcohol Beverage Services" "Alcohol Beverage Services" "Alcohol Beverage Services" ...
## $ Division : chr "Wholesale Administration" "Administrative Services" "Administration" "Wholesale Operations" ...
## $ Gender : chr "F" "F" "M" "F" ...
## $ Base.Salary : num 78902 35926 167345 90848 78902 ...
## $ X2020.Overtime.Pay : num 199 0 0 0 205 ...
## $ X2020.Longevity.Pay: num 0 4039 0 5718 2460 ...
## $ Grade : chr "18" "16" "M2" "21" ...
colSums(is.na(db2020))
## Department Department.Name Division Gender
## 0 0 0 0
## Base.Salary X2020.Overtime.Pay X2020.Longevity.Pay Grade
## 0 0 0 0
str(db2021)
## 'data.frame': 9907 obs. of 8 variables:
## $ Department : chr "ABS" "ABS" "ABS" "ABS" ...
## $ Department.Name : chr "Alcohol Beverage Services" "Alcohol Beverage Services" "Alcohol Beverage Services" "Alcohol Beverage Services" ...
## $ Division : chr "Beer Loading" "Liquor and Wine Delivery Operations" "Beer Delivery Operations" "Beer Delivery Operations" ...
## $ Gender : chr "M" "M" "M" "M" ...
## $ Base.Salary : num 87969 80086 80086 70814 76419 ...
## $ X2021.Overtime.Pay : num 32953 32656 31369 29838 21379 ...
## $ X2021.Longevity.Pay: num NA 1105 1591 NA NA ...
## $ Grade : chr "20" "18" "18" "18" ...
colSums(is.na(db2021))
## Department Department.Name Division Gender
## 0 0 0 0
## Base.Salary X2021.Overtime.Pay X2021.Longevity.Pay Grade
## 0 4312 7494 0
str(db2022)
## 'data.frame': 10052 obs. of 8 variables:
## $ Department : chr "ZAH" "ZAH" "ZAH" "ZAH" ...
## $ Department_Name: chr "Office of Zoning and Administrative Hearings" "Office of Zoning and Administrative Hearings" "Office of Zoning and Administrative Hearings" "Office of Zoning and Administrative Hearings" ...
## $ Division : chr "ZAH 05 Zoning and Administrative Hearings" "ZAH 05 Zoning and Administrative Hearings" "ZAH 05 Zoning and Administrative Hearings" "ZAH 05 Office of Zoning and Administrative Hearings" ...
## $ Gender : chr "M" "F" "F" "F" ...
## $ Base.Salary : num 142100 162000 93895 77180 196632 ...
## $ Overtime.Pay : num NA NA NA NA NA NA NA NA NA NA ...
## $ Longevity.Pay : num NA NA NA NA NA NA NA NA NA NA ...
## $ Grade : chr "N35" "N35" "N21" "N18" ...
colSums(is.na(db2022))
## Department Department_Name Division Gender Base.Salary
## 0 0 0 0 0
## Overtime.Pay Longevity.Pay Grade
## 4266 7653 0
db2019 <- db2019 %>%
rename("Overtime" = "X2019.Overtime.Pay",
"Longevity" = "X2019.Longevity.Pay",
"Department Name" = "Department.Name",
"Base Salary" = "Base.Salary") %>%
mutate("Year" = 2019)
db2020 <- db2020 %>%
rename("Overtime" = "X2020.Overtime.Pay",
"Longevity" = "X2020.Longevity.Pay",
"Department Name" = "Department.Name",
"Base Salary" = "Base.Salary") %>%
mutate("Year" = 2020)
db2021 <- db2021 %>%
rename("Overtime" = "X2021.Overtime.Pay",
"Longevity" = "X2021.Longevity.Pay",
"Department Name" = "Department.Name",
"Base Salary" = "Base.Salary") %>%
mutate("Year" = 2021)
db2022 <- db2022 %>%
rename("Overtime" = "Overtime.Pay",
"Longevity" = "Longevity.Pay",
"Department Name" = "Department_Name",
"Base Salary" = "Base.Salary") %>%
mutate("Year" = 2022)
db2019$Overtime <- sub(0,NA,db2019$Overtime)
db2020$Overtime <- sub(0,NA,db2020$Overtime)
db2021$Overtime <- sub(0,NA,db2021$Overtime)
db2022$Overtime <- sub(0,NA,db2022$Overtime)
db2019$Longevity <- sub(0,NA,db2019$Longevity)
db2020$Longevity <- sub(0,NA,db2020$Longevity)
db2021$Longevity <- sub(0,NA,db2021$Longevity)
db2022$Longevity <- sub(0,NA,db2022$Longevity)
colSums(is.na(db2019))
## Department Department Name Division Gender Base Salary
## 0 0 0 0 0
## Overtime Longevity Grade Year
## 6321 8494 0 0
colSums(is.na(db2020))
## Department Department Name Division Gender Base Salary
## 0 0 0 0 0
## Overtime Longevity Grade Year
## 6233 8485 0 0
colSums(is.na(db2021))
## Department Department Name Division Gender Base Salary
## 0 0 0 0 0
## Overtime Longevity Grade Year
## 6252 8416 0 0
colSums(is.na(db2022))
## Department Department Name Division Gender Base Salary
## 0 0 0 0 0
## Overtime Longevity Grade Year
## 6344 8340 0 0
Through our EDA process, we are looking to find hints to answer our initial prompts. We are also looking to find information that will support defining what the new “pre-covid state” would look like.
str(db2019)
## 'data.frame': 10105 obs. of 9 variables:
## $ Department : chr "BOA" "BOA" "BOA" "BOE" ...
## $ Department Name: chr "Board of Appeals Department" "Board of Appeals Department" "Board of Appeals Department" "Board of Elections" ...
## $ Division : chr "Board of Appeals Division" "Board of Appeals Division" "Executive Director" "Director" ...
## $ Gender : chr "F" "F" "F" "F" ...
## $ Base Salary : num 78902 58482 144751 183654 62488 ...
## $ Overtime : chr NA NA NA NA ...
## $ Longevity : chr NA NA NA NA ...
## $ Grade : chr "18" "16" "M3" "M1" ...
## $ Year : num 2019 2019 2019 2019 2019 ...
options(scipen = 999)
plot2019 <- ggplot(db2019, aes(`Department`, `Base Salary`))+
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6) +
theme(axis.text.x = element_text(angle = 45),
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("2019 Salaries by Department") +
xlab("Department")
ggplotly(plot2019)
stat2019 <- db2019 %>%
group_by(Department) %>%
summarise("Total Salaries" = sum(`Base Salary`,na.rm=T),
"Average Salaries" = mean(`Base Salary`,na.rm=T),
"Count of Data Records" = n(),
"Lowest Salary" = min(`Base Salary`,na.rm=T),
"Highest Salary" = max(`Base Salary`,na.rm=T),
"Average Overtime Pay" = mean(`Overtime`, na.rm=T)) %>%
mutate("Salary Gap" = `Highest Salary` - `Lowest Salary`)
head(stat2019)
## # A tibble: 6 × 8
## Department `Total Salaries` Average …¹ Count…² Lowes…³ Highe…⁴ Avera…⁵ Salar…⁶
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 BOA 282135. 94045. 3 58482. 144751 NA 86269.
## 2 BOE 2387357. 79579. 30 37506 183654. NA 146148.
## 3 CAT 9127798 115542. 79 20470. 210143 NA 189673.
## 4 CCL 10297327. 100954. 102 31441. 206000 NA 174559.
## 5 CEC 5816075. 70073. 83 20718. 186735. NA 166017.
## 6 CEX 5598448. 124410. 45 16732. 303091 NA 286359.
## # … with abbreviated variable names ¹`Average Salaries`,
## # ²`Count of Data Records`, ³`Lowest Salary`, ⁴`Highest Salary`,
## # ⁵`Average Overtime Pay`, ⁶`Salary Gap`
stat2019compare <- stat2019 %>%
ggplot(aes(x=`Average Salaries`,
y=`Salary Gap`)) +
geom_point(aes(text = paste("Department: ",`Department`))) +
geom_smooth(method=lm , se = T) +
theme_minimal()
ggplotly(stat2019compare + labs(title = "2019 Average Salary Versus Salary Gap"))
str(db2020)
## 'data.frame': 9958 obs. of 9 variables:
## $ Department : chr "ABS" "ABS" "ABS" "ABS" ...
## $ Department Name: chr "Alcohol Beverage Services" "Alcohol Beverage Services" "Alcohol Beverage Services" "Alcohol Beverage Services" ...
## $ Division : chr "Wholesale Administration" "Administrative Services" "Administration" "Wholesale Operations" ...
## $ Gender : chr "F" "F" "M" "F" ...
## $ Base Salary : num 78902 35926 167345 90848 78902 ...
## $ Overtime : chr "199.17" NA NA NA ...
## $ Longevity : chr NA NA NA "5717.68" ...
## $ Grade : chr "18" "16" "M2" "21" ...
## $ Year : num 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
options(scipen = 999)
plot2020 <- ggplot(db2020, aes(`Department`, `Base Salary`))+
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6) +
theme(axis.text.x = element_text(angle = 45),
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("2020 Salaries by Department") +
xlab("Department")
ggplotly(plot2020)
stat2020 <- db2020 %>%
group_by(Department) %>%
summarise("Total Salaries" = sum(`Base Salary`,na.rm=T),
"Average Salaries" = mean(`Base Salary`,na.rm=T),
"Count of Data Records" = n(),
"Lowest Salary" = min(`Base Salary`,na.rm=T),
"Highest Salary" = max(`Base Salary`,na.rm=T),
"Average Overtime Pay" = mean(`Overtime`, na.rm=T)) %>%
mutate("Salary Gap" = `Highest Salary` - `Lowest Salary`)
head(stat2020)
## # A tibble: 6 × 8
## Department `Total Salaries` Average …¹ Count…² Lowes…³ Highe…⁴ Avera…⁵ Salar…⁶
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 ABS 24559238. 56071. 438 18574. 220000 NA 201426.
## 2 BOA 282135. 94045. 3 58482. 144751 NA 86269.
## 3 BOE 2373064. 81830. 29 37881. 183654. NA 145773.
## 4 CAT 8547934. 115513. 74 20470. 210143 NA 189673.
## 5 CCL 9442174. 95375. 99 30000 206000 NA 176000
## 6 CEC 5766979. 69482. 83 22971. 186735. NA 163765.
## # … with abbreviated variable names ¹`Average Salaries`,
## # ²`Count of Data Records`, ³`Lowest Salary`, ⁴`Highest Salary`,
## # ⁵`Average Overtime Pay`, ⁶`Salary Gap`
stat2020compare <- stat2020 %>%
ggplot(aes(x=`Average Salaries`,
y=`Salary Gap`)) +
geom_point(aes(text = paste("Department: ",`Department`))) +
geom_smooth(method=lm , se = T) +
theme_minimal()
ggplotly(stat2020compare + labs(title = "2020 Average Salary Versus Salary Gap"))
str(db2021)
## 'data.frame': 9907 obs. of 9 variables:
## $ Department : chr "ABS" "ABS" "ABS" "ABS" ...
## $ Department Name: chr "Alcohol Beverage Services" "Alcohol Beverage Services" "Alcohol Beverage Services" "Alcohol Beverage Services" ...
## $ Division : chr "Beer Loading" "Liquor and Wine Delivery Operations" "Beer Delivery Operations" "Beer Delivery Operations" ...
## $ Gender : chr "M" "M" "M" "M" ...
## $ Base Salary : num 87969 80086 80086 70814 76419 ...
## $ Overtime : chr "32953.25" NA "31369.15" "29837.85" ...
## $ Longevity : chr NA NA "1591.2" NA ...
## $ Grade : chr "20" "18" "18" "18" ...
## $ Year : num 2021 2021 2021 2021 2021 ...
options(scipen = 999)
plot2021 <- ggplot(db2021, aes(`Department`, `Base Salary`))+
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6) +
theme(axis.text.x = element_text(angle = 45),
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("2021 Salaries by Department") +
xlab("Department")
ggplotly(plot2021)
stat2021 <- db2021 %>%
group_by(Department) %>%
summarise("Total Salaries" = sum(`Base Salary`,na.rm=T),
"Average Salaries" = mean(`Base Salary`,na.rm=T),
"Count of Data Records" = n(),
"Lowest Salary" = min(`Base Salary`,na.rm=T),
"Highest Salary" = max(`Base Salary`,na.rm=T),
"Average Overtime Pay" = mean(`Overtime`, na.rm=T)) %>%
mutate("Salary Gap" = `Highest Salary` - `Lowest Salary`)
head(stat2021)
## # A tibble: 6 × 8
## Department `Total Salaries` Average …¹ Count…² Lowes…³ Highe…⁴ Avera…⁵ Salar…⁶
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 ABS 26083216. 57579. 453 18574. 220000 NA 201426.
## 2 BOA 290596 96865. 3 63587 146923 NA 83336
## 3 BOE 2243380. 86284. 26 41703. 189951. NA 148248.
## 4 CAT 9230528. 121454. 76 22202 220000 NA 197798
## 5 CCL 10845916. 99504. 109 17963 209090 NA 191127
## 6 CEC 6866033. 73043. 94 24131. 189536. NA 165405.
## # … with abbreviated variable names ¹`Average Salaries`,
## # ²`Count of Data Records`, ³`Lowest Salary`, ⁴`Highest Salary`,
## # ⁵`Average Overtime Pay`, ⁶`Salary Gap`
stat2021compare <- stat2021 %>%
ggplot(aes(x=`Average Salaries`,
y=`Salary Gap`)) +
geom_point(aes(text = paste("Department: ",`Department`))) +
geom_smooth(method=lm , se = T) +
theme_minimal()
ggplotly(stat2021compare + labs(title = "2021 Average Salary Versus Salary Gap"))
We see that the 2021 has a lot of na values in the overtime and longevity pay columns. Taking a look into the dataset, we see that it is due to being blank values, rather than entering 0. In the next set we will clean up the dataset.
str(db2022)
## 'data.frame': 10052 obs. of 9 variables:
## $ Department : chr "ZAH" "ZAH" "ZAH" "ZAH" ...
## $ Department Name: chr "Office of Zoning and Administrative Hearings" "Office of Zoning and Administrative Hearings" "Office of Zoning and Administrative Hearings" "Office of Zoning and Administrative Hearings" ...
## $ Division : chr "ZAH 05 Zoning and Administrative Hearings" "ZAH 05 Zoning and Administrative Hearings" "ZAH 05 Zoning and Administrative Hearings" "ZAH 05 Office of Zoning and Administrative Hearings" ...
## $ Gender : chr "M" "F" "F" "F" ...
## $ Base Salary : num 142100 162000 93895 77180 196632 ...
## $ Overtime : chr NA NA NA NA ...
## $ Longevity : chr NA NA NA NA ...
## $ Grade : chr "N35" "N35" "N21" "N18" ...
## $ Year : num 2022 2022 2022 2022 2022 ...
options(scipen = 999)
plot2022 <- ggplot(db2022, aes(`Department`, `Base Salary`))+
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6) +
theme(axis.text.x = element_text(angle = 45),
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("2022 Salaries by Department") +
xlab("Department")
ggplotly(plot2022)
stat2022 <- db2022 %>%
group_by(Department) %>%
summarise("Total Salaries" = sum(`Base Salary`,na.rm=T),
"Average Salaries" = mean(`Base Salary`,na.rm=T),
"Count of Data Records" = n(),
"Lowest Salary" = min(`Base Salary`,na.rm=T),
"Highest Salary" = max(`Base Salary`,na.rm=T),
"Average Overtime Pay" = mean(`Overtime`, na.rm=T)) %>%
mutate("Salary Gap" = `Highest Salary` - `Lowest Salary`)
head(stat2022)
## # A tibble: 6 × 8
## Department `Total Salaries` Average …¹ Count…² Lowes…³ Highe…⁴ Avera…⁵ Salar…⁶
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 ABS 26744480. 60100. 445 18854. 220000 NA 201146.
## 2 BOA 297932. 99311. 3 67555. 148607 NA 81052.
## 3 BOE 2391333. 85405. 28 45114. 170000 NA 124886.
## 4 CAT 9517404. 123603. 77 23044 236000 NA 212956
## 5 CCL 12874837. 106404. 121 32842 225725. NA 192883.
## 6 CEC 7081281. 70813. 100 21500 185000 NA 163500
## # … with abbreviated variable names ¹`Average Salaries`,
## # ²`Count of Data Records`, ³`Lowest Salary`, ⁴`Highest Salary`,
## # ⁵`Average Overtime Pay`, ⁶`Salary Gap`
stat2022compare <- stat2022 %>%
ggplot(aes(x=`Average Salaries`,
y=`Salary Gap`)) +
geom_point(aes(text = paste("Department: ",`Department`))) +
geom_smooth(method=lm , se = T) +
theme_minimal()
ggplotly(stat2022compare + labs(title = "2022 Average Salary Versus Salary Gap"))
mergedSalary <- do.call("rbind",list(db2019,db2020,db2021,db2022))
statAllYear <- mergedSalary %>%
group_by(Department, Year) %>%
summarise("Total Salaries" = sum(`Base Salary`,na.rm=T),
"Average Salaries" = mean(`Base Salary`,na.rm=T),
"Count of Data Records" = n(),
"Lowest Salary" = min(`Base Salary`,na.rm=T),
"Highest Salary" = max(`Base Salary`,na.rm=T),
"Average Overtime Pay" = mean(`Overtime`, na.rm=T)) %>%
mutate("Salary Gap" = `Highest Salary` - `Lowest Salary`)
head(statAllYear)
## # A tibble: 6 × 9
## # Groups: Department [2]
## Department Year Total Salar…¹ Avera…² Count…³ Lowes…⁴ Highe…⁵ Avera…⁶ Salar…⁷
## <chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 ABS 2020 24559238. 56071. 438 18574. 220000 NA 201426.
## 2 ABS 2021 26083216. 57579. 453 18574. 220000 NA 201426.
## 3 ABS 2022 26744480. 60100. 445 18854. 220000 NA 201146.
## 4 BOA 2019 282135. 94045. 3 58482. 144751 NA 86269.
## 5 BOA 2020 282135. 94045. 3 58482. 144751 NA 86269.
## 6 BOA 2021 290596 96865. 3 63587 146923 NA 83336
## # … with abbreviated variable names ¹`Total Salaries`, ²`Average Salaries`,
## # ³`Count of Data Records`, ⁴`Lowest Salary`, ⁵`Highest Salary`,
## # ⁶`Average Overtime Pay`, ⁷`Salary Gap`
statOnlyYear <- mergedSalary %>%
group_by(Year) %>%
summarise("Total Salaries" = sum(`Base Salary`,na.rm=T),
"Average Salaries" = mean(`Base Salary`,na.rm=T),
"Count of Data Records" = n(),
"Lowest Salary" = min(`Base Salary`,na.rm=T),
"Highest Salary" = max(`Base Salary`,na.rm=T),
"Average Overtime Pay" = mean(`Overtime`, na.rm=T)) %>%
mutate("Salary Gap" = `Highest Salary` - `Lowest Salary`)
# Comparison plots
compareAllp1 <- statAllYear %>%
ggplot(aes(x = `Average Salaries`, y = `Salary Gap`)) +
geom_point(aes(size = `Count of Data Records`,
color = Department,
alpha = 0.6))+
theme_minimal() +
geom_smooth(method = lm, se = T)+
facet_wrap(~Year,nrow=2)+
theme(legend.position='none')+
scale_x_continuous(labels = scales::dollar_format())+
scale_y_continuous(labels = scales::dollar_format())+
labs(title="Salary Gap Versus Average Salary by Year")
ggplotly(compareAllp1)
# Create average for each department year
mergedSalaryFinal <- merge(mergedSalary, statAllYear, by = c("Department","Year"))
compareAllp2 <- mergedSalaryFinal %>%
ggplot(aes(x = `Base Salary`, y = `Average Salaries`)) +
geom_point(aes(color = Department,
alpha = 0.6))+
theme_minimal() +
facet_wrap(~Year,nrow=2)+
theme(legend.position='none')+
scale_x_continuous(labels = scales::dollar_format())+
scale_y_continuous(labels = scales::dollar_format())+
labs(title="Base Salary Versus Average Salary Per Year")
# ggplotly(compareAllp2) # too much lag to actual run due to number of items
compareAllp2
# Create average for year
cor(mergedSalaryFinal$`Base Salary`, mergedSalaryFinal$`Average Salaries`)
## [1] 0.4388825
model <- lm(`Average Salaries`~`Base Salary`, data = mergedSalaryFinal)
summary(model)
##
## Call:
## lm(formula = `Average Salaries` ~ `Base Salary`, data = mergedSalaryFinal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51815 -6956 -281 5610 78351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65510.572193 170.977633 383.15 <0.0000000000000002 ***
## `Base Salary` 0.192618 0.001971 97.71 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12090 on 40020 degrees of freedom
## Multiple R-squared: 0.1926, Adjusted R-squared: 0.1926
## F-statistic: 9548 on 1 and 40020 DF, p-value: < 0.00000000000000022
From our linear regression model between the average salaries and base salary, we find our formula to be:
\[ Average Salaries = 65510.57 + 0.192618 * Base Salary \]
growthRate = statOnlyYear %>%
arrange(Year) %>%
mutate(diffYear = Year - lag(Year),
diffGrowth = `Average Salaries` - lag(`Average Salaries`),
ratePerc= (diffGrowth / diffYear)/lag(`Average Salaries`) * 100)
AvgGrowth = mean(growthRate$ratePerc,na.rm = TRUE)
print(paste("Our average salaries are increasing at a", round(AvgGrowth,2), "% rate."))
## [1] "Our average salaries are increasing at a 2.72 % rate."
# Do some graphs
yearAvgSalPlot <- statOnlyYear %>%
ggplot(aes(x = Year, y = `Average Salaries`)) +
geom_line() +
geom_point() +
theme_minimal() +
labs(title="Trend of Total Average salaries")
ggplotly(yearAvgSalPlot)
| Year | Average Salary | Percentage Change |
|---|---|---|
| 2019 | $78,936.02 | |
| 2020 | $78,771.46 | -0.208% |
| 2021 | $81,339.53 | 3.260% |
| 2022 | $85,503.28 | 5.119% |
mu <- mergedSalaryFinal %>%
group_by(Gender) %>%
summarise("grp.mean" = mean(`Average Salaries`,na.rm=T))
yearHistGenderPlot <- mergedSalaryFinal %>%
ggplot(aes(x = `Average Salaries`, color = Gender)) +
geom_histogram(fill = "White",position = "dodge", alpha = 0.5) +
geom_vline(data = mu, aes(xintercept=grp.mean,color=Gender),
linetype = "dashed") +
facet_wrap(~Year,nrow = 2)
ggplotly(yearHistGenderPlot)
Here we can see in each year, on average females had a higher average salary as compared to their male counterparts.
Due to some limitations with the datasets, we want to pull in outside resources to support statistical analysis. Some of the data points we will be able to use are US Inflation rates, Montgomery County unemployment rates, amongst others.
# Discuss changes in employment counts/avg/std by department
# Pull in inflation data
inflation <- data.frame(year=c("2019","2020","2021","2022"),
rate=c(0.023,0.014,0.07,0.065))
# Unemployment Rate for Montgomery County provided by U.S. Bureau of Labor Statistics (BLS)
unemploymentRate <- data.frame(year=c("2019","2020","2021","2022"),
rate=c(0.027,0.054,0.042,0.028))
# discuss Maryland rules on work from home and some timings on major rulings
# Break out of departments that may have some interest
# CEX <- Has the largest salary gap
# IGR <- highest average salary
# LIB HRC TBS <- this group has single outlier on top
# POL DOT <- Has large amount of outliers above and below
# Add statistical analysis
# ML?
# Bayesian Stats?
# Mann-Whitney U Tests?
Luh, Kathy (2020). Employee Salaries - 2019, dataMontgomery. https://data.montgomerycountymd.gov/Human-Resources/Employee-Salaries-2019/qatd-z57d
Luh, Kathy (2021). Employee Salaries - 2020, dataMontgomery. https://data.montgomerycountymd.gov/Human-Resources/Employee-Salaries-2020/he7s-ebwb
Luh, Kathy (2022). Employee Salaries - 2021, dataMontgomery. https://data.montgomerycountymd.gov/Human-Resources/Employee-Salaries-2021/kmkb-bmhe
Luh, Kathy (2023). Employee Salaries - 2022, dataMontgomery. https://data.montgomerycountymd.gov/Human-Resources/Employee-Salaries-2022/njz9-yp4y
U.S. Bureau of Labor Statistics, Unemployment Rate in Montgomery County, MD [MDMONT0URN], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/MDMONT0URN, March 22, 2023.
“CPI Inflation Calculator.” U.S. Bureau of Labor Statistics, https://www.bls.gov/data/inflation_calculator.htm, March 22, 2023.